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Abstract 

We study the statistical properties of large random networks with 
specified degree distributions. New techniques are presented for an- 
alyzing the structure of social networks. Specifically, we address the 
question of how many nodes exist at a distance from a given node. 
We also explore the degree distribution of for nodes at some distance 
from a given node. Implications for network sampling and diffusion 
on social networks are described. 



1 Introduction 

Random network models have a long history in the social networks literature. 
Rapoport et. al. were the first to propose random graphs as models of social 
networks JHll2nilS] 5 while simultaneously the basic theory of random graphs 
was established in the mathematics literature by Erdos et. al [Sj. Thereafter, 
periodic efforts were made to specify with greater detail the random or sta- 
tistical nature of social networks, for example with the biased random net 
theory of Frank j^], Skvoretz [23], Fararo [HE], and others. 

More recently, significant contributions have been made by statistical 
physicists, especially regarding the aggregate statistical attributes of net- 
works O Q^]. The degree distribution has been shown to be one of 
the most important features of a network in determining network structure. 

*The Cornell email network was provided by Cornell Information Technologies (CIT). 
Special thanks to Jim Howell and Don Macleod at CIT for help preparing the data. Thanks 
to Matt Salganik, Douglas Heckathorn, and Stephen Strogatz for valuable comments. 

^Department of Sociology, Cornell University, emaihemv7@cornell.edu 
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Consequently, random networks with specified degree distributions have been 
proposed as a model of large, complex social networks [HJ [TUl HU HZ] . 

In this article, we describe techniques for revealing subtle aspects of net- 
work structure, taking as given a certain degree distribution. Our method 
relies on network tomography [TT], the idea of mapping out a network layer 
by layer from a single node. The method is described in section |2] below. 

The appropriateness of the random graph model must vary from pop- 
ulation to population. Certainly a degree distribution does not determine 
the overall structure of a network. It is possible for a network with a given 
degree sequence to have extreme differences from a corresponding random 
network ^7ll2Zll2ni- But even in such cases, differences are likely to be infor- 
mative, suggesting unique mechanisms that move a network away from the 
random regime. 

This work has implications for networks sampling, the study of diffu- 
sion and mathematical epidemiology, as well as other dynamic processes on 
networks. All of these problems involve the marriage of network structure 
with network dynamics. To answer dynamical questions, it is desirable to 
specify network structure with greater precision. Unfortunately, even in ran- 
dom networks of the type studied here, namely semi-random networks with 
given degree distributions, there are many topological questions which re- 
main unanswered. We will focus on two: 1. How many individuals are there 
at any distance from a given node? 2. Among all nodes at a given distance, 
what is the degree distribution among those nodes? Example applications 
are further described in section 

2 Network tomography 

In all that follows, we assume a network size n, and a degree distribution 
Pk (The probability of a node being degree k is pk)- Multiple connections 
and loops are allowed, however it should be noted that such connections 
are exceedingly rare for large n. Our networks are undirected. Connections 
within the network are entirely random but for these constraints. 

Having constructed such a network, we can play the following thought 
experiment. Pick a node, v o uniformly at random within the giant component 
of the network 1 . We will call vo the seed. This node will have a degree > 1, 

1 A component in a network is a maximal set of nodes such that there exists a path 
between any two of them. A giant component is a component which occupies a fraction 
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and a number of neighbors at distance one. Those nodes in turn will have 
a degree distribution specific to themselves, and a number of connections to 
other nodes at distance two from vq. We can continue in this way, eventually 
breaking the entire giant component into disjoint sets defined by the distance 
from our seed. Some nodes may not be enumerated in this way, in which 
event they fall outside of the giant component. 

What we just described is the basic premise of network tomography. Network 
tomography, originally described in JT] , is a method for revealing the struc- 
ture of a random network by exploration, layer by layer, from a single starting 
node. 

Now we can ask a host of questions with consequences for the structure 
of the network whole: 

• How many nodes are there at distance / from the seed Vq7 

• What is the degree distribution within each layer? 

• What is the size of the giant component? 

• What is the degree distribution within the giant component versus 
outside the giant component? 

• What is the expected centrality of a seed Vo picked at random in this 
way? What about the centrality of a degree k node? 

All of these questions can be answered as outlined below. The method is 
shown schematically in figure ^ 

Let Si be the number of connections originating from layer I. For example, 
for I = 0, So is the degree of vq. Let Ri be the number of connections from 
layer I — 1 to layer /. Finally, let T\ be the number of connections originating 
from nodes outside of layers m < I. 

Let So = zq where z is the average degree in the giant component of 
the network 2 . T = nz — z Q , where z is the average degree in the network 
as a whole, and Ro = 0. To continue mapping out the network, we need a 

of the nodes in the network in the limit of large network size. 

2 We can choose any degree for our seed, though some of the statistics we derive will be 
dependent on this parameter. 
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Schematic of the network tomograhic method 



Figure 1: This diagram illustrates the tomographic method detailed in the 
text. Starting from a single node Vq we recursively explore nodes at distance 
/ from Vq. Ri is the number of connections going to layer I from layer I — 1. 
Si is the number of connections to nodes in layer /. 7] is the number of 
connections not connected to nodes in layer / or less. The importance of 
these quantities is explained in the text. 
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recurrence relation on these quantities: 




To proceed further, and determine the exact form of /.(•), we will need to 
draw on a technique widely employed in the complex networks literature, the 
probability generating function. Probability generating functions have found 
numerous applications to the study of complex networks. The first examples 
were given in f3J • A good general reference to generating function meth- 
ods is [HD], and applications of generating functions to branching processes 
are given in |B] and [T]. 

Probability generating functions are created by transformation of discrete 
probability distributions into the space of polynomials. We will need just one 
generating function corresponding to our degree distribution: 



Frequently we find that generating functions converge to simple algebraic 
functions, in which cases we can perform any operation on the algebraic 
version of the generating function instead of the series expansion. This con- 
stitutes one of the primary uses of probability generating functions. 

In the examples that follow we will concern ourselves with two easy to 
study degree distributions: 

1. Poisson. This is the degree distribution of classical random graphs 

k —z 

as studied by the Erdos and Rapoport among others, pt = - | ■ This is 
generated by 



See [Tj)| for a derivation of these generating functions. 

Returning to the tomographics problem, consider the probability that a 
connection emerging from layer / will go to a node in layer I + 1, given that 
the connection does not go to layer I — 1. Since our networks are completely 




(1) 




(2) 



2. Exponential, pk = (1 — e l l z )e k l z . This is generated by 




(3) 
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random, such a connection has uniform probability of going to any of the 
"stubs" originating from nodes in layers m > I, as well as stubs originating 
from nodes in layer I, minus those stubs which are already allotted to layer 
/ — 1. This gives us the following: 



+1+1 



Ti + Si- R t 

For convenience, we now define the following quantity: 



oil = ai-i 



Tt + S,- R, 



This is the probability of a conjunction of events, namely that a connection 
goes to a node outside of layer /, given that the connection has not attached 
to layers m < I. 

Note that the probability that a degree k node lies outside the first I 
layers is the probability that all k of the nodes connections go to other nodes 
outside of layers m < I. This is simply 

Now it can be asked: What is the average degree of a node outside of 
layers m < 11 We have 

< k > T = p k k/c (4) 

k 

where c is the appropriate normalizing constant: 

c = Y^ ak Pk 
k 

The value of our generating function approach is now apparent, as we can 
easily express the above in terms of our generating function g{x): 

< k > T = n[ d g ^ X " > ] x=1 /g(a l ) = a l g , (an)/g(a) (5) 

By similar reasoning, the total number of connections originating from 
nodes outside of layer / + 1 is: 

dg(aix) 

Ti+i = n[ — — — ] x=1 = na t g (a,) (6) 
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Once this is known, S and R follow easily. S is equivalent to the change 
in the number of connections between two adjacent layers. R will be the 
expected number of connections going between two adjacent layers. We have: 

Si+i = Ti — Ti + i 

Ri+i = S i T + s _ R = Sioti/ai-i 

This recurrence relation can be solved to any desired depth. Below it 
will be shown that many interesting quantities can be computed from the 
sequences of S,T, and R. 3 



2.1 Descriptive statistics 

Let's return the questions from section |21 With the simple recurrence rela- 
tion El and [7| we can now characterize many feature of our network. Once a 
sequence of values of Si, Ri, and T} have been computed, it is quite simple 
to determine many things about the structure of our network by plugging in 
the appropriate values into our generating functions. 

Of foremost importance is the size of each layer, that is the number of 
nodes at some distance from our seed. We know that the probability of a 
degree k node being outside layer I is af. Then the probability of a degree k 
node being within layer I is af_ 1 — af. So, choosing a node at random, the 
probability of that node being in layer I will be YlkPk( a i~i~ a i)- Translating 
this into our generating function language, and multiplying by the population 
size n, we have 

m = n(g(ai-i) - g(ai)) (7) 



3 It is worth noting that the recurrence relation on S,T, and R can be simplified to a re- 
currence relation on just two variables, due to that S is not a function of itself. Specifically, 
by eliminating S, we get 

Ti+i = n 



Ri+i — 



Ti-x — Ri 
and 

Tl-! 

a l = a l-l7f, 5 
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The size of the giant component is even easier to derive. Let = 
lim^oo ai 4 . This is the probability that a connection goes to a node at 
distance infinity from the seed, or in other words is outside of the giant com- 
ponent. The probability that a degree k node is outside the giant component 
is then a 1 ^. Following similar reasoning as above we find the size of the giant 
component to be 

n gc = n(l - g(a 00 )) (8) 

As we move outward from our seed, we find that the degree distribution 
changes within each layer of the network. Initially the average degree tends to 
increase, as nodes are connected to with probability proportional to degree. 
But quickly high degree nodes are exhausted, and the average degree within 
a layer decreases sharply. 

In the l'th layer the probability of a node being degree k given by 

PW = ^(ak - a*) (9) 



cV 1 -^)"- (10) 



p k Si - Ri 
cTt + Si-Rt 



(11) 



where c is the appropriate normalizing constant for the degree distribution. 
When a is close to zero, it dominates the above expression, and thus the 
distribution converges to a power law as we move away from the seed. Of 
course, if pk decays faster than a power law (e.g. exponentially) then the 
distribution will theoretically not have the "fat tails" characteristic of power- 
laws for large k. This happens regardless of the degree distribution of the 
network as a whole. 

Using identical reasoning as we used to determine the number of nodes in 
layer Z, we can determine the generating function for the degree distribution 
in layer I. 

g{ai-\x) -g(aix) . , 

9A X ) = — 7 \ 7 — \~~ I 2 

g(ai-i) -g(ai) 

Note that g(ai-±) — g(ai) is in the denominator to normalize the distribution. 

4 It is interesting to note that ctoo corresponds to the probability of a connection not 
being to the giant component, u, as derived by Newman et al. in The way that this 
quantity is computed is somewhat different. 
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The degree distribution outside of the giant component is similarly easy 
to derive: 

= -^y (13) 

And the degree distribution within the giant component is the complement: 

= "W-f f' (14) 

An important sociological consideration is the mean path length and the 
associated closeness centrality statistic [2H Having chosen a seed, we 
can compute the average distance to other nodes in the network using the 
quantities calculated above: 

i>i gc 

This can be considered the expected closeness centrality of a degree z Q node 
in the network, where zq is the degree of our seed. 



3 Theoretical Examples 

The reader may find it helpful if we illustrate the preceding ideas with a few 
simple, idealized examples. 

Many social networks fall into one of two regimes. The simplest case 
is for the degree distribution to be relatively homogeneous, as occurs when 
individuals connect to one another with uniform probability. This leads to 
the classical random networks such as those studied by Rapaport and Erdos. 
These are characterized by a symmetric, unimodal distribution, namely the 
Poisson generated by equation El In the second regime, we find that a minor- 
ity of individuals act as "hubs" for the network, thereby accounting for the 
great majority of connections in the network [2|. This leads to highly skewed 
degree distributions such as power-laws and simple exponentials. Although 
highly idealized, both of these simple cases may have something to teach us 
about the structure of real social networks. 

We have explored both Poisson and Exponential networks using simula- 
tion and the tomographic methods discussed above. Consider the Poisson 
degree distribution, with generating function |21 Let n = 50000. 
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By combining equations 121 and we find that the degree distribution in 
layer / is generated by 



,z(asi-ix— 1) _ p z(aix— 1) 



e z(ai_i-l) _ e «(aj-l) 



(16) 




(17) 



where 7; = 7]/ (Tj + Si - R t ). 



It can be verified that this satisfies the requirements for a probability 
generating function, namely that it has a series expansion, and that gi(l) = 1. 
Figure El shows the degree distribution for z = 3 at various layers. The solid 
lines represent the theoretical solutions given by El and the points, where 
present, mark the results of simulation. 40 networks of size n = 50000 and 
with Poisson degree distribution, z = 3 were generated. For each network 
20 seeds were chosen independently, and the network was mapped out from 
each. Averaging these simulations yield the data points shown. 

Furthermore we can explore how the network changes its structure as the 
mean of the degree distribution, z, is swept over a range of values. Figure El 
shows the results of one simulation where z = 1.25,3,5 and n = 50000 as 
before. The average number of nodes at various distances from a randomly 
chosen seed is shown. Dotted lines represent the results of simulations, while 
the solid lines represent the theoretical prediction. The dotted line above the 
theoretical prediction shows the 90 'th percentile among simulations. Likewise 
the dotted line below shows the 10'th percentile. It can be seen that our 
theory correctly captures the trend as we increase z from 1.25 to 5. 

The theoretical prediction for figure El is derived by solving our generating 
function El and using [7| We find: 



Figures 0] and show identical experiments for the exponential degree 
distribution |21 The mathematics is somewhat more tedious for this case, so 
we omit it here. 

Now viewing the results for the exponential and Poisson experiments, 
several things bear mention. As we observed above, the degree distribution 
converges to a skewed exponential or power-law as we move to higher layers 
in the network. This occurs despite the homogeneous degree distribution of 



71/ = ne 



z(a;_l-l)Q _ e za;_i(7i-l) 



(18) 



where j t = T\j (T t + Si - R t ). 
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Layer 3 Layer 7 




Degree distribution within layer s 



Figure 2: n = 50000, Poisson degree distribution, z = 3. Data points are 
the average of 40 generated networks with 20 trials per network. Solid lines 
represent the theoretical prediction given by El 
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z=1.25 z = 3 




ni, the number of nodes within each layer 

Figure 3: n = 50000, Poisson degree distribution, z = 1.25, 3, 5. Data points 
show the 10th and 90th percentile for 40 randomly generated networks with 
20 trials per network. Solid lines represent the theoretical prediction given 
byd 
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Figure 4: n = 50000, Exponential degree distribution, z = 3. Data points 
are the average of 40 generated networks with 20 trials per network. Solid 
lines represent the theoretical prediction given bvlTTl 
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the Poisson networks. In fact, our theory predicts an exponential tail for both 
of these distributions for high layers. However, we observe the "fat-tails" of 
power laws instead. This is most likely a finite-size effect. 

The existence of hubs in the exponential networks lead to several inter- 
esting differences with the Poisson networks. It can be seen from the n\ 
experiments that the exponential has a narrower peak than the Poisson. As 
soon as a path is found from Vq to a hub, the rest of the network can be 
reached in very few steps. It is also interesting that the degree distribu- 
tion for the exponential random networks has its mode shifted rightward of 
in the first several layers, thus making its distribution more reminiscent 
of the Poisson. This is yet another consequence of the existence of hubs in 
these networks; the higher mode bulge in these distribution represents the 
existence of higher degree hubs a short distance from vq. 

4 Email Network 

The ideas presented here can be illustrated with a real social network. The 
network shown in figure El is the giant component for a one-day sample of 
email traffic for individuals at Cornell University. This includes a diverse 
collection of faculty, researchers, students and administrators. The commu- 
nication linking them is correspondingly diverse, motivated by work, research 
and social affiliation. 

In communication networks such as these, it is very important to develop 
a sense of tie-strength between individuals, particularly for email networks, 
as a great deal of communication does not indicate a meaningful relationship, 
but merely the spread of cheap information (i.e. "spam"). Fortunately, there 
is an easy way to distinguish genuine social affiliation from simple information 
transfer. If persons in the network exchange emails in both directions within 
the 24 hour sampling frame, that is a strong indication that the conversants 
are well-acquainted and socially connected. We can then induce a subnetwork 
by including only those ties which are reciprocal. 

In what follows, two networks will be considered. The first is the raw 
communication network, with no distinction made between reciprocal and 
non-reciprocal communication. For convenience, this will be referred to 
as the R/NR network. This network consists of 14216 nodes with 25040 
connections. The giant component of the network occupies 13577 of the 
nodes (95.5%). 
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z=1.25 z = 3 




Figure 5: n — 50000, Exponential degree distribution, z = 1.25, 3, 5. Data 
points show the 10th and 90th percentile for 40 randomly generated networks 
with 20 trials per network. Solid lines represent the theoretical prediction 
given by 
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Figure 6: The giant component from the Cornell email network. Connections 
in the network represent reciprocal communication within a 24 hour sampling 
frame. The nodes are color-coded. Blue nodes are faculty, red nodes are 
graduate students, green nodes are undergraduates, and yellow nodes are 
everyone else, mainly administrators. The network 2607 nodes and 4838 
connections. The giant component consists of 1227 nodes. 
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Degree 



Figure 7: Degree distributions for the reciprocal and non-reciprocal email 
networks. Solid lines show a fit designed to match the average degree of the 
empirical distribution. The theoretical density is given by equation (|19p. 



The second network consists only of reciprocal email connections and the 
nodes which have such connections. This will be called the R network. This 
network is much smaller, consisting of only 2607 nodes with 4838 connections. 
The giant component occupies 1227 nodes (47.1%). 

The degree distributions for both the R and R/NR networks are shown 
in figure [7| Both distributions are evidently power laws, as they lie approx- 
imately on a straight line with log/log axes. The solid lines show a fit to 
these data of a power law density with exponential cutoff: 

p ' = ii^)- fcsl (19) 

where Li n (x) is the nth polylogarithm of x. To apply the tomographic theory, 
we need the generating function for this density. This is given by 

g{x) = Li 7 (xe- 1/K )/Lz 7 (e- 1/K ). (20) 

When applying the tomographic theory, it is possible to use the empirical 
degree distribution, but as the theoretical distributions appear to fit the em- 
pirical power laws very well, we will use the theoretical distributions instead. 
Figure |H1 shows the stratum sizes predicted for the R/NR network using 
equation (JJJ) (solid line). The dotted lines above and below the theoretical 
prediction are the actual 90th and 10th percentile stratum sizes from the 
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Stratum (1) 



Figure 8: Theoretical (solid line) and empirical (dotted line) stratum sizes for 
the R/NR email network. This network includes both reciprocal and non- 
reciprocal communication within the 24 hour sampling frame. The upper 
dotted line represents 90th percentile stratum sizes picking a seed from the 
network uniformly at random. The lower dotted line represents the 10th 
percentile. 



R/NR network. The theory matches observations fairly well for the R/NR 
network. A very different situation is illustrated by figure El which shows 
the theoretical stratum sizes (solid line) alongside the mean stratum size 
for the R network (dotted line). There is clearly a great deviation between 
theory and observation. Nevertheless, this difference is instructive. The R 
network shows only strong ties, in contrast to the R/NR network which con- 
tains both strong and weak ties. Consequently, there are many more social 
micro-structures in the R network than would be expected in a pure random 
network. The clustering coefficient 5 , a measure of network transitivity, is 
much greater for the R network (C = 7.4%) than for the R/NR network 
(C = 1.86%). Of course, in a pure random network of these sizes, C ~ 0. 
Micro-structures such as these contribute to the deviations seen in figure M 
because they push the social network away from the pure random regime on 
which the network tomographic theory is based. As shown in j27] , clustering 
has the effect of increasing mean path length and decreasing the giant com- 

The clustering coefficient, C, is defined as the ratio of the number of triads to the 
number of potential triads in a network: C — 3^ where is the number of triads in 
the network and N3 is the number of connected triples of nodes. Note that in every triad 
there are three connected triples. 
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Stratum (1) 

Figure 9: Theoretical (solid line) and empirical (dotted line) stratum sizes 
for the R email network. This network includes only reciprocal communi- 
cation within the 24 hour sampling frame. The dotted line represents the 
mean empirical stratum size, selecting a seed from the network uniformly at 
random. 



ponent size. This is why a more elongated series of stratum sizes is observed 
in figure M 

5 Discussion 

The methods discussed here have relevance for disparate areas of networks 
research. 

Consider the problem of network sampling- the utilization of social net- 
works for surveying a population. Lately methods of chain-referral sampling 
have been proposed [3 I2H] which model chain-referral samples as random 
walks on social networks. In general, little is known about the attributes 
of individuals reached after n steps of such a random walk. Tomographic 
methods may open a new window on the problem. We can now compute the 
expected properties of a node at a given distance from our starting point, as 
well as the probability that a random walk will be at that distance after a 
given number of steps. This allows us to answer questions such as 

• How many different nodes could possibly be reached after n steps? 

• What is the probability of the n'th node in a chain referral sample 
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having degree k? 



• What is the probability of being at distance 1 from our starting point 
after n steps? 

It is beyond the scope of this paper to provide answers to these questions, 
but it is certainly possible using network tomography. 

Another potential application is to the study of network diffusion- the 
study of dynamical processes which spread through a population via network 
connections. Examples include the adoption of innovations [2H1 122] as well 
as the spread of information or rumors [3 EH]. The {ni} curves shown above 
are highly reminiscent of birth and death processes such as the spread of an 
epidemic through a population of susceptible individuals. In fact, the way we 
have mapped out our network from a single node is somewhat like the way 
an infectious agent may spread through a population from an initial infected. 
Previous research J2] has investigated the structural properties of diffusion 
of this sort, e.g. the proportion of the network that is ultimately occupied by 
infecteds. But it has been difficult to place a timescale on diffusion without 
resorting to computer simulation. It is hoped that progress will soon be made 
with the application of network tomography to these and related problems. 

All of these results must be taken with the caveat that real networks may 
not be organized as simple random networks. As mentioned above, there is 
no guarantee that a real social network will exhibit the same sequences of n\ 
or pk-i as in the random regime. Extra forces can shape the network topology 
and push these statistics away from the pure random regime. These statistics 
can be thought of as something that help characterize the structure of the 
network, like a fingerprint of its structure. When the statistics deviate from 
the random regime, it is an indication that unique and potentially interesting 
forces are affecting the network. 

A simple example is furnished by the potential existence of greater than 
random transitivity (i.e. triadic closure), which can certainly affect the num- 
ber of nodes at a given distance from our seed as well as the degree dis- 
tribution at that distance [23 • However, with more study it may even be 
possible to adapt the tomographic method to account for transitivity and 
other non-random structures within social networks. 
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